Student Name: Daniele Pennacchia
Project Title: Is it all about WHERE we live?
For this Data Science project I downloaded the dataset Melbourne Housing Market from Kaggle. This dataset is related to the housing market of the australian Capital and was scraped from publicly available results from Domain.com.au as the author of this kaggle's page (https://www.kaggle.com/anthonypino/melbourne-housing-market) reported. The ultimated dataset provides 34857 observations of sold houses within 21 features (or variables) from January 2016 until March 2018.
The domain I would like to focus is on the house pricing market in order to investigate the dynamic behind it and try to find some insights both on a time series analysis (try to find if there was a decline in houses' prices and when it has occured, or if there is an increasing/decreasing trend through the time involved, for instance) and on a explorative way, trying to predict the future house prices or even try to find some cluster insights.
I plan the following strategy steps and analysis:
As already mentioned, and after loaded the data, we have a big data set with 34857 observations (rows) along with 21 variables (or columns).
Unfortunately, the dataset is not tidy and presents several missing values in almost all its features. Thus, after a first data exploration we will need to convert some features data type, do some data transformation (we might add some extra information as new features) and also clean and impute data as well as make some feature engineering.
# This cell is where you can copy + paste your Python code which loads your data and produces
# When you press CTRL+Enter, this cell will execute and produce some output
# You can develop your code in Spyder (or another IDE) and copy + paste over here
# Step-1: Load your data
# Step-2: Get an overview of the data features, some suggestions to look for:
# number of features, data types, any missing values?,
# any transformations needed?
# Start with your import (s) here.
# The following is essential to get your matplotlib plots inline, so do not miss this one if you have graphics.
%matplotlib inline
# Some usual imports here
import csv as csv
import numpy as np
import pandas as pd
# Continue here with your code
#Import libraries (and change the directory too)
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn #during this coursework we will import all what we need for ML analysis (from sklearn.linear_model import LinearRegression for example..)
# conda install -c conda-forge folium geopandas tornado==4.5.3 #pre-installer per folium vertytime to run into the cmd prompt
# import folium #(heatmap and other visualisation tools ..we need to import some 'features')
import datetime
import os
#os.chdir(r'C:\Users\Daniele\Desktop\City - MSc Data Science\PDS\Coursework') #wd of my personal laptop
os.chdir(r'\\nsq023vs\u6\aczc947\Desktop\PrinciplesDataScience\Coursework') #wd of my City account
import warnings
warnings.filterwarnings("ignore")
#Load data
df_housing = pd.read_csv('../Coursework/Melbourne_housing_FULL.csv')
pd.set_option('display.max_columns', None) #in order to display all the var/columns
df_housing.head()
Suburb: Suburb
Address: Address
Rooms: Number of rooms
Type: h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse.
Price: Price in Australian dollars of a house
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
SellerG: Real Estate Agent
Date: Date related to the house transaction (i.e. sold, etc.)
Distance: Distance from Central Business Distric (CBD) in Kilometres
Postcode: Self explanatory
Bedroom2 : Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size in Metres
BuildingArea: Building Size in Metres
YearBuilt: Year the house was built
CouncilArea: Governing council for the area
Lattitude: Self explanatory
Longtitude: Self explanatory
Regionname: General Melbourne Region (West, North West, North, North east ...etc)
Propertycount: Number of properties that exist in that specific suburb.
# Checking info of variables (type + NaN in particular), then sorted (descending order)
df_housing.info()
df_housing.isnull().sum().sort_values(ascending=False)
# Percentage of missing values
df_housing.isnull().sum()/len(df_housing)*100
Some data types seem not correclty assigned.. For example postcode, bedroom2 bathroom,car and PropertyCourt are float, while they should be categorical (Postcode) and integer (all the others).
We tend to prefer categorical type variables instead of object type (to reduce memory usage).
Also Date should be converted in date-format and we can "split" this variable into "Year", "Month" and "Day" for future investigation (feature engineering part). We found out that there are quite a lot of missing values related to the followig features: BuildingArea, YearBuilt and Landsize.
# Changing type of variables:
# create a subset of all object variables that we want to transform in categoriccal
obj_to_cat = ['Suburb', 'Address', 'Type', 'Method','SellerG', 'CouncilArea','Regionname']
for colname in obj_to_cat:
df_housing[colname] = df_housing[colname].astype('category')
# Convert Date to datetime variable
df_housing['Date'] = pd.to_datetime(df_housing['Date'])
# Convert Postcode (float) to categorical
df_housing['Postcode']=df_housing['Postcode'].astype('category')
#change Bathroom, bath, car and propertycoun to integer. Before the transformation I need to convert all the missing values to 0.
# Check confirmed changes
df_housing.info()
Just a bit of initial exploratory analysis: investigate different features to determine the relationships between the main ones (in particular related to the variabile Price).
# pairplot to see the pairwise relations (joint relationships and histrograms between univariate distributions)
sns.pairplot(df_housing.dropna())
If we zoom this pairplot, we can see some in throughout the diagonal the distribution for each numerical variable, while above or below the diagonal we have the joint pairwise distributions.
It is interesting to note that the main features are fairly right skewed as we can expected with the low/medium values as the predominant ones (such as houses with 3 rooms with 1 or 2 bathrooms).
Looking to the pairwise distribution we can denote how Rooms is rather linear with the variable Bedroom2, perhaps the second one could be redundant. Another feature correlated with the number of rooms is Bathrooms and of course this seems "normal".
The Price seems to be (positively) influenced by the number of rooms, bedrooms, bathrooms and the building area (this last distribution should be scaled in order to have a better view of that relationship), while the more is the distance (from the city centre) the less is the price of a house.
# Features relationship with Price
feat = ['Rooms','Bathroom','Bedroom2', 'Distance','Car', 'Landsize', 'BuildingArea', 'YearBuilt']
for i in feat:
sns.lmplot(data=df_housing, x=i, y='Price')
plt.show()
for i in feat:
sns.boxplot(x=i, y='Price',data=df_housing)
plt.show()
Viewing both the linear regression plot (lmplot) and the boxplots we can easily spot the linear relation between prices and the other variables.
Briefly, the price of a house will tend to increase as the number of rooms reach "six", while after six rooms the average price slighlty decreases. Finally with 10, 12 and 16 (one observation) rooms the price tend to increase further. Nevertheless we have also to point out that the most variability and outliers are in those houses with a lower number of rooms and bathrooms.
We can clearly spot an outlier relating the YearBuilt feature.
The same situation we have with bathrooms (increasing of price until 5 bathrooms). Then we have a fluctuation on the average price (and on the variability) when a house has 6 to 9 bathrooms.
The car spots seems to not really influence the price of a house, while the distance has a negative impact on price.
At the end, we can point out that there is a constant (limitated) ascending trend between price and building area.
# re-scaling Landsize & BArea
# but first I have to remone NAs from BA e Landsize
df_housing['BArea_clean'] = df_housing['BuildingArea'].dropna()
df_housing['BArea_clean'].describe()
ax =sns.regplot(data=df_housing, x='BArea_clean', y='Price')
plt.xlim(0,600) #limit set up to mean value + stdev
plt.ylim(0,10000000) #the most expensive house is sold at 11.6mil
plt.xlabel('Building Area cleaned')
plt.title('Price vs BuildingArea ')
plt.show()
# re-scaling Landsize w/o NAs
df_housing['Land_clean'] = df_housing['Landsize'].dropna()
df_housing['Land_clean'].describe() #23047 rows (ok)
ax =sns.regplot(data=df_housing, x='Land_clean', y='Price')
plt.xlim(0,4000) #limit set up to mean value + stdev
plt.ylim(0,10000000) #the most expensive house is sold at 11.6mil that's why we set 10mil as ymax
plt.xlabel('Landsize cleaned')
plt.title('Price vs Landsize ')
plt.show()
With the zoom in, and the help of basic statistics we can summarise (for now) two main aspects for BuildingArea and Landsize:
Regarding distribution, both are right skewed, but apart from outliers, the 75% percentile is within 188m$^2$ and 670m$^2$ respectively, so the most houses have "standard" size.
In terms of correlation, BuildingArea has a slight positive trend with Price, especially for houses until 400m$^2$. Although Landsize has a lower correlation with Price (almost next to nothing), houses with a landsize's range between 0-1500m $^2$ might suggest a small increasing trend.
#Let's drop these 2 columns for now. We will investigate more these features in the Data Cleaning section
df_housing.drop(columns=['BArea_clean','Land_clean'], inplace=True, axis=1)
df_housing.head()
# plotting categorical features with Price
cat_feat = ['Type','Regionname','CouncilArea','Method']
for i in cat_feat:
f, ax = plt.subplots(1,1, figsize = (8,10))
sns.boxplot(x=i, y='Price',data=df_housing)
plt.xticks(rotation=90)
plt.title('%s vs Price' %i, fontsize = 16)
plt.xlabel(i, fontsize = 14)
plt.ylabel('Price', fontsize = 14)
plt.show()
# count, unique values, mode value and frequency of categorical features
print(df_housing[cat_feat].describe())
#Plot per Region
plt.figure(figsize = (12,12))
sns.boxplot(x='Type', y='Price',data=df_housing ,hue='Regionname')
plt.title('Type vs Price per Region', fontsize = 16)
plt.xlabel('Type of house', fontsize = 14)
plt.ylabel('Price', fontsize = 14)
plt.show()
# Create some output (describing statistics) for Type,Region and Method
for i in cat_feat:
df_housing['Price'].groupby(df_housing[i]).describe()
df_housing['Price'].groupby(df_housing['Type']).describe()
df_housing['Price'].groupby(df_housing['Regionname']).describe()
df_housing['Price'].groupby(df_housing['Method']).describe()
From the boxplots and the following summary statistics tables we can summarize:
The modal value for the type of house is "h"(house, cottages or villas) with a median price over for houses are over 1mil. Then, there are townhomes ("t" type) with a median price around 800k-900k even if townhouse are those kind of houses with the smallest proportion of the dataset. Unit or duplex houses ("u") are the less expensive with a median price approximately at 500k.
The Metropolitan Region has higher house prices compared to the Victoria Region, with the Southern Metropolitan being the area with the highest median home price. This region has also the biggest density of houses and city of Melbourne is included in this area.
The outcome for the Council area is a stricted result of the previous one: council areas with highest median price are in the Metropolitan Region.
Home prices with different selling methods are relatively the same across the board. The Vendor Bid method seems to have a little more variability and outliers compared to the others as well as a higher mean and median value.
Early, we noticed that there are 2 features that could be very similar (Rooms and Bedroom2). Moreover the description of the second variable is not very clear ("Scraped # of Bedrooms (from different source)") and should have some typing/import errors. Let's have a closer look:
#Exploring correlation between features
cor_mat = df_housing.corr()
f, ax = plt.subplots(figsize=(14, 12))
plt.title('Correlation matrix of numeric Features', size =18)
sns.heatmap(cor_mat, vmax=.9, square=True, linewidths=.4 ,annot=True,cmap ='BuPu', annot_kws={'size':12});
From the above correlation matrix it can be seen the pairwise correlation between all the numeric features.
Price seems to be correlated with Rooms, Bathroom, Bedroom2 and Distance and YearBuilt (a negative correlation with the last two).
Focusing only to Rooms and Bedroom2 we can spot a very high correlation (0.95) that could suggest a possible redundancy.
# Let's plot "Rooms" with "Bedroom2"
sns.lmplot(data= df_housing, x='Rooms', y='Bedroom2')
As we can see, the relation between these 2 variables is almost linear. We can notice some outliers (Bedroom2 with 20 and 30, corrisponding with Rooms values both at 3. While 1 Bedroom2 is set with 9 Rooms).
# Let's look at those obs with higher values (possible outliers) of Room and Bedroom2
Data_view = df_housing[(df_housing["Bedroom2"] == 20) | (df_housing["Bedroom2"] == 30) | (df_housing["Rooms"] == 9)]
Data_view
Here we have the evidence that probably there were some typing errors (for instance 30 Bedroom2 should have been 3 with 1 bathroom insted of 12, while 20 Bedrooms2 should have been only 2, that summed up with 1 bathroom leads to a 3 total rooms of that specific house).
Moreover, since these 2 variables are linear and Bedroom2 has more missing values than Rooms, the former one will be removed from the dataset.
#Finally, we can drop the "Bedroom2" feature
df_housing = df_housing.drop(['Bedroom2'], axis=1)
# Let's visualise some basic statistics in order to explore the main features more closely
pd.set_option('display.max_columns', None)
df_housing.describe()
Striking outputs:
# create auxiliary columnS from Bathroom and Car, in order to keep these new variables as float (because after the following loop "Bathroom" and "Car" will become object var due to the insert of "Missing" as a string)
df_housing['Bath_clean'] = df_housing['Bathroom'].dropna()
df_housing['Car_clean'] = df_housing['Car'].dropna()
num_feat = ['Rooms', 'Bathroom', 'Car']
for i in num_feat:
df_housing[i] = df_housing[i].fillna('Missing')
sns.countplot(data=df_housing, x=i)
plt.xticks(rotation=60)
plt.show()
df_housing.info() #with the previous loop, filling the missing with "Missing", I'll have Bath and Car from int/float into "object" (Room does not have any misssing so it remains "int")
df_housing["Bathroom"].value_counts()
df_housing[(df_housing["Bathroom"] == 0) & (df_housing["Landsize"] <= 50)]
# we obtain 24 (out of 46) with no bath & no landsize (& NaN on BArea..)
We spot 9 houses with 7+ bathrooms, while at the same time there are 46 houses without a single bathroom. Let's try to have a deeper look:
From this subset of "non-bathroom houses" we can highlight the absence of landsize and building area too. One possibility could be that these houses are vey old houses with a common bathoom outside the house. But the fact is that also YearBuilt is missing in this subset, so we can not be sure about this hypothesis.
Among these 24 observations, there are 6 that does not have a value for Price, we can delete these ones, while we will take into account the others trying to fix them in the Imputation part of this project.
df_housing_old = df_housing.copy() #create a copy of the previous df
df_housing;
#delete these 6 rows
df_housing.drop(df_housing[(df_housing["Bathroom"] == 0) & (df_housing["Landsize"] <= 50) & (pd.isna(df_housing['Price']))].index, inplace=True)
df_housing = df_housing.reset_index(); #reset index in order to avoid missing "rows"
df_housing.info(); #check! it should be 34851 entries (from 34857)
df_housing.drop(columns=['index'], inplace=True, axis=1); #drop the index column
df_housing['Bath_clean'].describe()
df_housing[(df_housing["Bath_clean"] > 6) & (df_housing["Rooms"] < 7)]
Here we have 4(out of 9) houses with more bathrooms than rooms (already seen that house with "12 bathrooms": typing error). One curious fact is that the one without landsize(& BuildingArea) is the most expensive between those kind of houses
Morever we can see that 3 out of 4 are old houses (built in 1950 or before), having a huge value of landsize. For now we decide to keep this observations.
#drop Bath_clean columns
df_housing.drop(columns=['Bath_clean'], inplace=True, axis=1)
df_housing["Car"].value_counts()
# try another method to visualise multplie value of the same feature (Car in this case)
# high value of "car" (parking spots) in relation with landsize (and price..if we find in these "outliers" missing values of price we might decide to delete the observations)
for i in (9,10, 11,12,18,26): #value range for the values of car >8
print(df_housing[(df_housing["Car"] == i)] );#are there typing errors? Is there a relationship between high values of car spot and landsize?
df_housing["Landsize"].describe()
If we look at these 14 "observations" with high values of Car we can easily spot that they are quite correlated with high values of Landsize too. All but one are greater than the mean value of Landsize and bigger than the 3rd quartile.
The one with the highest car value (26) is also the one with the smallest Landsize value in this subset. It also show a missing value in Price (together with other four observation). We can proceed to eliminate these five rows/obs (that doesn't have a price value)
df_housing.drop(df_housing[(df_housing["Car_clean"] > 8) & (pd.isna(df_housing['Price']))].index, inplace=True)
df_housing = df_housing.reset_index(); #reset index in order to avoid missing "rows"
df_housing.info();
# Let's drop "Car_clean"column.
df_housing.drop(columns=['Car_clean'], inplace=True, axis=1)
df_housing.head();
df_housing.drop(columns=['index'], inplace=True, axis=1);
Moving on BuildingArea and Landsize observation (extreme values).
# Infer about min/max values of BuildingArea and Landsize
df_housing['Landsize'].describe()
df_housing[df_housing['Landsize']>17600] #values > than mean+5*stdev
In this subset most houses have a missing value of BuildingArea, while 3 have a missing value in Price. If we combined those ones with 'NAs' both on Price and Building Area we obtained two observations (index 29982 and 32556) that we can delete.
Moreover we see an unexpected value for the observation 22624, having a BuildingArea greater than its Landsize. We will try to fix successively; while a striking outlier in Landsize is detected on the observation 18028 with more than 43k square meters.
#drop these 2 obs
df_housing.drop(df_housing[(df_housing['Landsize']>17600) & (pd.isna(df_housing['BuildingArea'])) & (pd.isna(df_housing['Price']))].index, inplace=True)
df_housing = df_housing.reset_index(); #reset index in order to avoid missing "rows"
df_housing.drop(columns=['index'], inplace=True, axis=1)
df_housing.info();
# Check on min values
df_housing[df_housing['Landsize']<50]; #we have more than 2000 obs, but the most have '0' as value of Landsize.
df_housing[(df_housing['Landsize']<50) & (df_housing['Landsize']>0) &(df_housing["BuildingArea"] > 50)]
Houses with zero on Landsize caught my attention, and after some researches (in Kaggle and on google), we have found out that these houses probably are the so-called "zero-lot-line" houses. In other words residential real estate in which the structure comes up to or very near the edge of the property line (https://www.investopedia.com/terms/z/zero-lot-line-house.asp, https://www.kaggle.com/stephaniestallworth/melbourne-housing-market-eda-and-regression).
Therefore, these observations are valid and will remain the data set.
Moreover we will create a new feature (in the Feature Engineering) that takes into account Landsize and BuildingArea.
# Check on "unexpected" values: Landsize less than BArea
df_housing[(df_housing['Landsize']<df_housing['BuildingArea']) & (df_housing['Landsize']>0) ]
# Replace unexpect values with Landsize Value
df_housing_new = df_housing.copy(); #df copy per sicurezza
df_housing['BuildingArea_Replace'] = np.where( ( (df_housing['Landsize']<df_housing['BuildingArea']) & (df_housing['Landsize']>0) ) , df_housing['Landsize'], df_housing['BuildingArea'])
# check
df_housing[(df_housing['Landsize']<df_housing['BuildingArea_Replace']) & (df_housing['Landsize']>0) ] #null list, ok!
# select those ones with "missing" Price (then delete these obs)
df_housing[(df_housing['Landsize']<df_housing['BuildingArea']) & (df_housing['Landsize']>0) & (pd.isna(df_housing['Price']))]
# previous 112 obs that we proceed to delete
df_housing.drop(df_housing[(df_housing['Landsize']<df_housing['BuildingArea']) & (df_housing['Landsize']>0) & (pd.isna(df_housing['Price']))].index, inplace=True)
df_housing = df_housing.reset_index(); #reset index in order to avoid missing "rows"
df_housing.drop(columns=['index'], inplace=True, axis=1) #drop column index
df_housing.info(); #34732 (34844-112) #of obs remaining
# Investigate in BArea (max and min values)
df_housing['BuildingArea'].describe()
# Investigate in BArea (max and min values)
df_housing[df_housing['BuildingArea']>2160] #values > than mean+5*stdev
Ok, we had already fixed these "high values" adding the BuildingArea__Replace column. Nevertheless we have a suspect outlier with 44500 on Building area that might be reduced a little.
df_housing['BuildingArea'].sort_values(ascending = False); #sort desc values before BuildingArea_Replace
df_housing['BuildingArea_Replace'].sort_values(ascending = False);
The (former) top5 obersvationss with the highest values on BArea were replaced with "more ordinary" values, so we can clearly modify the value of the last one from 44500 ( BuildingArea__Replace value) with 4451.5 (that was the original value of BuildingArea divided by 10)
# change value of BArea/BArea_Replace
df_housing['BuildingArea'].replace(44515.0, 4451.5, inplace=True)
df_housing['BuildingArea_Replace'].replace(44500.0, 4451.5, inplace=True)
#check
df_housing[df_housing['BuildingArea']>2160]; #ok check
# Min values of BArea
df_housing[df_housing['BuildingArea']<10]; #we obtain 149 houses
df_housing[df_housing['BuildingArea']<1]; #77
It seems that 149 houses have a very small surface (with 77 of these that do not have any building area at all!).
One possibility is that these houses were sold only because of their landsize and geographical position, and that there was still no house builded in it. However, the majority of these has a value in YearBuilt, so it might be some incoherence in this case..Another possibility could be that these are simply data entry mistakes (again?).
In conclusion, I think there is something wrong with most of this subsample, so I will drop at least those ones with a "zero-building area".
# Drop these 77 obs (no Build_Area)
df_housing.drop(df_housing[(df_housing['BuildingArea']<1) ].index, inplace=True)
df_housing = df_housing.reset_index();
df_housing.info(); #check (34655 obs)
df_housing.drop(columns=['index'], inplace=True, axis=1);
df_housing[(df_housing['BuildingArea']<1)]; #second check
# Check on wrong Data (YearBuilt variable)
df_housing.loc[df_housing.YearBuilt>2018]
Replace these 2 mistakes with a (possible) correct YearBuilt: 2106 should be 2016, while 2019 could be 2009, 2016 or even 2018 (even if the house was sold in the early 2018). Moreover we can see that the Suburb is Bentleigh where "yearbuiilt"'s houses ranges from 1910 to 2018.. Let's have a look on Bentleigh's mode values:
df_bentleigh = df_housing.loc[df_housing.Suburb == 'Bentleigh'];#subDF of 319 obs (ok)
df_bentleigh.YearBuilt.value_counts(); #mode value of YearBuilt
# Mode value is 1950, followed by 1960,2012 e 1940..
# While 2016 has 2 obs, but 2018 does not have anyone, 2009 once and 2006 two obs.
# So we can replace wrong value with a proper one (choosing between 2006,2009,2016,2018)
## replace these 2 mistakes with a (possible) correct YearBuilt:
df_housing['YearBuilt'].replace([2106,2019],[2016,2016], inplace=True)
## re-check
df_housing.loc[df_housing.YearBuilt>2018] #check (ok)
Let's re-check how many missing data each variable has on the housing datafram
df_housing.info()
# First replace (with coercion) all the string values ('Missing') to mark them as missing values and then convert it to float.
df_housing['Bathroom'] = pd.to_numeric(df_housing.Bathroom, errors='coerce')
df_housing['Car'] = pd.to_numeric(df_housing.Car, errors='coerce')
df_housing.isnull().sum().sort_values(ascending=False)
As already seen, we have a lot a missing values, especially on BuildingArea, YearBuilt and Landsize.
Remember that we have both categorical and numerical features and some of these are "missclassified".
Let's begin with those features with very few missing value.
df_housing.info()
df_housing[(pd.isna(df_housing['Postcode']))]
df_housing[(pd.isna(df_housing['CouncilArea']))] #same as NAs for PropertyCount and Regionname
We can delete these observation, due to the fact that they do not bring information to our dataset (most features are missing)
df_housing.drop(df_housing[(pd.isna(df_housing['CouncilArea']))].index, inplace=True)
df_housing = df_housing.reset_index();
df_housing.drop(columns=['index'], inplace=True, axis=1);
df_housing[(pd.isna(df_housing['CouncilArea']))] ; #check
df_housing.tail() #check
df_housing.isnull().sum().sort_values(ascending=False) #check
df_housing[(pd.isna(df_housing['Price'])&(pd.isna(df_housing['Longtitude'])) &(pd.isna(df_housing['Lattitude'])))]
As shown above, we have more than 1700 observations that do not have Price, longitude and latitude. So we proceed to try and fix them, starting the imputation from the variables: Lattitude and Longtitude
df_housing['Lattitude'].groupby(df_housing['Regionname']).count()
df_housing['Lattitude'].groupby(df_housing['Suburb']).count()
For a more accurate imputation, we decide to impute values of Longitude and Latitude with their Suburbs' average values instead of using Region's values.
# group by Subs (mean value) in order to impute on LAt/Long values
df_latitude_null=df_housing.Lattitude[df_housing['Lattitude'].isnull()]
print(df_latitude_null.head())
aa = df_housing['Lattitude'].groupby(df_housing['Suburb']).mean()
aa.dropna();
aa1 = pd.DataFrame(aa)
aa1.index
aa1.reset_index(level=0,inplace=True)
Create a merged dataframe, then impute Latitude missing values
df_housing = pd.merge(df_housing,aa1,how='left',left_on = 'Suburb',right_on = 'Suburb')
for i,v in enumerate(df_housing.Lattitude_x):
if(np.isnan(df_housing.iloc[i]['Lattitude_x'])==True):
df_housing.iloc[i,16]=df_housing.iloc[i,21]
pd.set_option('display.max_columns', None) #check after imputation
df_housing.head(20)
df_housing.isnull().sum().sort_values(ascending=False) #re check null values
# I have to keep Lattitude_x (so, Lat_y can be dropped)
Now imputation for Longitude
# group by Subs (mean value) in order to impute on Long values
bb = df_housing['Longtitude'].groupby(df_housing['Suburb']).mean()
bb.head();
bb.dropna();
bb1 = pd.DataFrame(bb) #transform into a df
bb1.index; #check on indexes, then reset
bb1.reset_index(level=0,inplace=True);
Re-merge the df with bb1, then impute the Longitude missing values
df_housing = pd.merge(df_housing,bb1,how='left',left_on = 'Suburb',right_on = 'Suburb');
for i,v in enumerate(df_housing.Longtitude_x):
if(np.isnan(df_housing.iloc[i]['Longtitude_x'])==True):
df_housing.iloc[i,17]=df_housing.iloc[i,22] # have to change the index here (I have to insert 17 and 22 'cause i do not delete Lat_y )
pd.set_option('display.max_columns', None);
df_housing.head(20) #check
df_housing.isnull().sum().sort_values(ascending=False) #re check null values
We have reduced missing values (of Latitude and Longitude) from about 8 thousand to less than 90. The remaining observation with still missing values can be deleted straight-away.
#Drop the "duplicate" columns (lat and long), then delete the remaining NAs rows
df_housing.drop(columns=['Lattitude_y','Longtitude_y'], inplace=True, axis=1) #drop columns
df_housing.drop(df_housing[(pd.isna(df_housing['Lattitude_x']))].index, inplace=True) #drop the obs without values on Lat (and at the same time they do not have for Longitude too)
df_housing = df_housing.reset_index();
df_housing.drop(columns=['index'], inplace=True, axis=1);
df_housing[(pd.isna(df_housing['Lattitude_x']))] #check
Regarding Bathroom and Car I think that the simplest way to deal with missing values in these two cases is to substitute them with thier modal value.
## Imputation of bathroom and car with their mode value (bath =1, car =2)
df_housing_copy = df_housing.copy() #just a copy of the df
from sklearn.preprocessing import Imputer
imp = Imputer(strategy='most_frequent', axis=0)
df_housing['Bathroom'] = imp.fit_transform(df_housing[['Bathroom']])
df_housing['Car'] = imp.fit_transform(df_housing[['Car']])
df_housing.head(50) #check
df_housing.isnull().sum().sort_values(ascending=False)
So, right now we have to impute values for YearBuilt,Landsize and BuildingArea, and at the end Price.
For YearBuilt we can base our imputation looking at the type of a house, and see what is the median value for each type (median is more robust the the mean value).
cc = df_housing['YearBuilt'].groupby(df_housing['Type']).median()
cc1 = pd.DataFrame(cc) #transform into a df
cc1.index; #check on indexes, then reset
cc1.reset_index(level=0,inplace=True);
df_housing = pd.merge(df_housing,cc1,how='left',left_on = 'Type',right_on = 'Type');
df_housing.head()
# imputation on YearBuilt
for i,v in enumerate(df_housing.YearBuilt_x):
if(np.isnan(df_housing.iloc[i]['YearBuilt_x'])==True):
df_housing.iloc[i,14]=df_housing.iloc[i,21]
df_housing.head()
df_housing.isnull().sum().sort_values(ascending=False) #check
For BuildingArea and Landsize we try to find a relationship with Rooms. In fact we expect that area and land increase with increasing rooms.
Since there are a lot of missing value for the area features, we will probably opt out for "mean" instead of the "median" value. Let's see..
df_housing['Rooms_v2'] = np.where(df_housing['Rooms'] > 5, 6, df_housing['Rooms']) #create a col rooms ranging from 1 to 5+.
df_housing[df_housing['Rooms']>5].head(); #check
dd = df_housing['Landsize'].groupby(df_housing['Rooms_v2']).mean() #With mean we obtain:
dd
ddd = df_housing['Landsize'].groupby(df_housing['Rooms_v2']).median() #With median:
ddd
Using the mean seems better in this case. Moreover remind that within Landsize we have a lot of "zero-lot-houses" as we already mentioned.
dd1 = pd.DataFrame(dd) #transform into a df
dd1.index; #check on indexes, then reset
dd1.reset_index(level=0,inplace=True);
ee = df_housing['BuildingArea_Replace'].groupby(df_housing['Rooms_v2']).mean()
ee
eee = df_housing['BuildingArea_Replace'].groupby(df_housing['Rooms_v2']).median()
eee
Values of mean and median are quite similar. We choose median value in this case.
ee1 = pd.DataFrame(eee) #transform into a df
ee1.index; #check on indexes, then reset
ee1.reset_index(level=0,inplace=True);
df_housing = pd.merge(df_housing,dd1,how='left',left_on = 'Rooms_v2',right_on = 'Rooms_v2'); #Landsize
for i,v in enumerate(df_housing.Landsize_x):
if(np.isnan(df_housing.iloc[i]['Landsize_x'])==True):
df_housing.iloc[i,12]=df_housing.iloc[i,23]
df_housing = pd.merge(df_housing,ee1,how='left',left_on = 'Rooms_v2',right_on = 'Rooms_v2'); #BArea
for i,v in enumerate(df_housing.BuildingArea_Replace_x):
if(np.isnan(df_housing.iloc[i]['BuildingArea_Replace_x'])==True):
df_housing.iloc[i,20]=df_housing.iloc[i,24]
df_housing.head() #check after imputation Landsize+BArea
#Drop the "duplicate" columns (lat and long), then delete the remaining NAs rows
df_housing.drop(columns=['BuildingArea','BuildingArea_Replace_y','Landsize_y','YearBuilt_y'], inplace=True, axis=1) #drop columns
df_housing = df_housing.reset_index();
df_housing.drop(columns=['index'], inplace=True, axis=1);
# ## Export the dataset with the imputation "completed" (just for a security reasons / for loops take long time every time so we can restart from this point)
# df_housing.to_csv("Melbourne_after_imp.csv")
# ##Reload the data (from the imputated df) + re-run the initial section with all the packages if needed
#df_housing = pd.read_csv('../Coursework/Melbourne_after_imp.csv')
df_housing.isnull().sum().sort_values(ascending=False) #check
df_housing.info()
So, we have imputated all of the independent variables. Right now we can delete the missing observations for Price since it is our dependent variable.
df_housing_final = df_housing.copy()
df_housing_final.drop(df_housing_final[(pd.isna(df_housing_final['Price']))].index, inplace=True)
df_housing_final = df_housing_final.reset_index();
df_housing_final.drop(columns=['index'], inplace=True, axis=1);
df_housing_final[(pd.isna(df_housing_final['Price']))] #check
df_housing_final.info()
The dataset contains the year the home was built. Although this is being measured by the specific year, what this variable is really probing is the age of the home.
Thus, home age can be expressed in terms of contemporary (built in recent years), modern (a house built between 1930 and 1970) and historic (greater than 87 years old) to get the heart of this information in a more summarized way, allowing for better analysis and visualisation.
Morover we can create new categories from the variable Date in order to have a clearer understanding of the changes in the house market prices.
Finally, we might create an additional "Total Area" feature that takes into account the sum of Landsize and BuildingAreaReplace.
# first, create "Age" variable
df_housing_final["Age"]= 2018 - df_housing_final["YearBuilt_x"]
# Identify and divide houses into separate "time class"
df_housing_final["House_Class"] = np.where(df_housing_final["Age"] > 87, 'Historic',
(np.where(df_housing_final["Age"] < 48, 'Contemporary', 'Modern')))
# Convert this House_Class in a cat variable
df_housing_final["House_Class"] = df_housing_final["House_Class"].astype('category')
# From Date, calculate year, month, day of the year, weekday and also Season
df_housing_final["Date_x"] = pd.to_datetime(df_housing_final.Date) #first convert Date to datetime in a new feature
df_housing_final['Year']=df_housing_final.Date_x.dt.year
df_housing_final['Month']=df_housing_final.Date_x.dt.month
df_housing_final['Day']=df_housing_final.Date_x.dt.dayofyear
df_housing_final['Weekday']=df_housing_final.Date_x.dt.dayofweek
# divide in season (opposite in the australian hemisphere...)
fall = range(80,172)
winter = range(172, 264)
spring = range(264, 355)
# summer = everything else
seas =[]
for i in df_housing_final['Day']:
if i in fall:
Season = 'fall'
elif i in winter:
Season = 'winter'
elif i in spring:
Season = 'spring'
else:
Season = 'summer'
seas.append(Season)
# Add the resulting column to the df)
df_housing_final['Season']= pd.Series(seas)
# Create "Total Area"
df_housing_final["Total_Area"] = df_housing_final["BuildingArea_Replace_x"]+df_housing_final["Landsize_x"]
df_housing_final.head()
In this section, we will explore the "new" dataset with these new feautures just created, trying to gain insights with multiple analysis, including a Time Series and Cluster analysis.
df_housing_final["Total_Area"].describe()
bins=[0,450,700,950]
df_housing_final['Total_Area_bins']=np.digitize(df_housing_final.Total_Area,bins)
df_housing_final.Total_Area_bins.value_counts()
# Create a pivot table (median price) for each combination of features:
df_prices=df_housing_final.groupby(['Suburb','Year','Type','Rooms_v2','Total_Area_bins' ],as_index=False).Price.median()
df_prices.columns=['Suburb','Year','Type','Rooms_v2','Total_Area_bins','Median_price']
df_prices.head()
df_housing_final=df_housing_final.merge(df_prices,on=['Suburb','Year','Type','Rooms_v2','Total_Area_bins'])
df_housing_final['Sold_Above_Zone_Median']=0
df_housing_final.loc[df_housing_final.Price>df_housing_final.Median_price,'Sold_Above_Zone_Median']=1
df_housing_final.head()
# Price distribution for each Total_Area_bins
from pylab import rcParams
rcParams['figure.figsize'] = 10, 4
small_area=df_housing_final[df_housing_final.Total_Area_bins==1]
avg_area=df_housing_final[df_housing_final.Total_Area_bins==2]
big_area=df_housing_final[df_housing_final.Total_Area_bins==3]
out_area=df_housing_final[df_housing_final.Total_Area_bins==4]
fig1=sns.kdeplot(small_area.Price,color='red',shade=True)
fig2=sns.kdeplot(avg_area.Price,color='orange',shade=True)
fig3=sns.kdeplot(big_area.Price,color='green',shade=True)
fig4=sns.kdeplot(out_area.Price,color='blue',shade=True)
plt.legend(['small area','avg area','big area','outstanding area'])
plt.show()
From this plot is clear that houses with small or medium area have a small variability compared to those with great or oustanding "size" (landsize plus building area).
Let's look at houses and villas with 2 bedrooms and how their prices changed from year to year and from one suburb from another.
df_2rooms_h=df_housing_final[(df_housing_final.Type=='h')&(df_housing_final.Rooms==2)]
df_group=df_2rooms_h.groupby(['Suburb','Year'],as_index=False).Price.median()
pv=df_group.pivot_table(index='Suburb',columns='Year',values='Price')
pv['Trend']='unknown'
pv.loc[(pv[2017]>pv[2016]) & (pv[2018]>pv[2017]),'Trend']='up-up'
pv.loc[(pv[2017]>pv[2016]) & (pv[2018]<pv[2017]),'Trend']='up-down'
pv.loc[(pv[2017]<pv[2016]) & (pv[2018]>pv[2017]),'Trend']='down-up'
pv.loc[(pv[2017]<pv[2016]) & (pv[2018]<pv[2017]),'Trend']='down-down'
pv.loc[(pv[2017]>pv[2016]) & (pd.isna(pv[2018])),'Trend']='up-na'
pv.loc[(pv[2017]<pv[2016]) & (pd.isna(pv[2018])),'Trend']='down-na'
pv.loc[(pv[2018]>pv[2017]) & (pd.isna(pv[2016])),'Trend']='na-up'
pv.loc[(pv[2018]<pv[2017]) & (pd.isna(pv[2016])),'Trend']='na-down'
pv.loc[(pv[2018]>pv[2016]) & (pd.isna(pv[2017])),'Trend']='up'
pv.loc[(pv[2018]<pv[2016]) & (pd.isna(pv[2017])),'Trend']='down'
pv.head()
df_2rooms_h=df_2rooms_h.merge(pv,left_on='Suburb',right_index=True)
df_1_h=df_2rooms_h[df_2rooms_h.Total_Area_bins<3]
df_2rooms_h.head()
df_group=pd.crosstab(df_1_h.Suburb,df_1_h.Sold_Above_Zone_Median).apply(lambda x:x*100/sum(x),axis=1).merge(df_1_h[['Suburb','Trend']],left_index=True,right_on='Suburb')
df_group=df_group.loc[:,[0,1,'Trend','Suburb']].drop_duplicates() #'0' is %sold_below_median, '1' %above_median
df_group[(df_group["Trend"]=='up-up')]
Despite median values raised from 2016 to 2018, Princess Hill is the Suburb where to find good "price" occasions to buy a nice house/villa with 2 bedrooms, followed by the suburb of Kingsville.
df_group[(df_group["Trend"]=='down-down')]
Looking at those suburbs with a downgrade trend in the last two years, we can spot Brooklyn as the suburbs where houses, villas and cottages (2 bedrooms-style) are always sold underpriced. Investigating we can find an easy motivation about this:
"Brooklyn is significantly notorious for industrial pollution and sewer issues due to the high clay content in the soil causing trouble with the foundation. In recent years EPA Victoria has received numerous complaints about offensive smelling industrial pollution from Brooklyn by residents of nearby suburbs." (Wikipedia)
Looking at the big picture and considering the size of the dataset, we expect to have house sales spread all over these years (months and days). Let's see:
# Count the number of rows of the df grouped by day using a pivot table.
df_housing_final.pivot_table('Price', index='Date_x', aggfunc='sum').shape
df_housing_final.head()
# From the above function (shape) we can understand (surprisingly!) that in more than 2 years there were only 78 days of "sales"
import calendar
sales_by_month= df_housing_final.groupby('Month')['Price'].count()
#plot figure (by month)
plt.figure(figsize=(12,8))
plt.plot(sales_by_month, color="red")
plt.xlabel('Month')
plt.suptitle('Sold houses by months')
plt.ylabel('# Houses')
plt.xticks(np.arange(13), calendar.month_name[0:13], rotation=20)
plt.show()
#plot figure (by weekdays)
sales_by_week= df_housing_final.groupby('Weekday')['Price'].count()
plt.figure(figsize=(12,8))
plt.plot(sales_by_week, color="blue")
plt.xlabel('Weekdays')
plt.suptitle('Sold houses by weekdays')
plt.ylabel('# Houses')
plt.xticks(np.arange(8), calendar.day_name[0:8], rotation=20)
plt.show()
#plot figure (by season)
sales_by_season= df_housing_final.groupby('Season')['Price'].count()
plt.figure(figsize=(12,8))
plt.plot(sales_by_season, color="green")
plt.xlabel('Season')
plt.suptitle('Sold houses by Season')
plt.ylabel('# Houses')
plt.xticks(np.arange(5), rotation=20)
plt.show()
# Create index month-year
df_housing_final['Month_Year'] = df_housing_final['Date_x'].dt.to_period('M')
sales_by_month_year_v2= pd.Series.to_frame(df_housing_final.groupby('Month_Year')['Price'].count())
#plot figure(month-year)
fig, ax = plt.subplots(figsize=(12,8))
sales_by_month_year_v2.plot(ax=ax, xticks=sales_by_month_year_v2.index, rot=45)
ax.set_xticklabels(sales_by_month_year_v2.index)
plt.show()
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
# Suplots of Seasonal and yearly trend & price
sns.set_style('darkgrid')
f, axes = plt.subplots(2,2, figsize = (14,14))
# Plot [0,0]
sns.boxplot(data = df_housing_final, x = 'Season', y = 'Price', ax = axes[0, 0])
axes[0,0].set_xlabel('Season')
axes[0,0].set_ylabel('Price')
axes[0,0].set_title('Season & Price')
# Plot [0,1]
sns.violinplot(data = df_housing_final, x = 'Year', y = 'Price',ax = axes[0, 1])
axes[0,1].set_xlabel('Year')
axes[0,1].set_ylabel('Price')
axes[0,1].set_title('Year & Price')
# Plot [1,0]
sns.countplot(data = df_housing_final, x = 'Season', ax = axes[1,0])
axes[1,0].set_xlabel('Season')
axes[1,0].set_ylabel('Sold houses')
axes[1,0].set_title('Houses sold per Season')
# Plot [1,1]
sns.countplot(data = df_housing_final,x = 'Year', ax = axes[1,1])
axes[1,1].set_xlabel('Year')
axes[1,1].set_ylabel('Sold houses')
axes[1,1].set_title('Houses sold per Season')
Main insights:
Regarding season, there is no big diffence in price between houses sold on different parts of the year(boxplot "Season & Price"). Spring is the season with the most houses sold (reaching the peak in November), while Summer the worst one.
Year 2017 was the best year for the market by far, with apparently more high price outliers and doubled sold houses compared to the previous year. 2018 has of coursw less activity, due to data collection (Jan-March are the only months available).
Most houses are sold during Saturdays and Fridays.
#!pip install plotly
from plotly.offline import init_notebook_mode, iplot
import plotly.plotly as py
import plotly.figure_factory as FF
from plotly import tools
import plotly.graph_objs as go
init_notebook_mode(connected=True)
# Establish the sum (of sales) for each month and year
def month_year_sales(df, month, year):
double_conditional = df_housing_final['Price'].loc[(df_housing_final['Month'] == month) & (df_housing_final['Year'] == year)].sum()
return double_conditional
labels = ['January', 'February', 'March', 'April',
'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December']
colors = ['#ffb4da', '#b4b4ff', '#daffb4', '#fbab60', '#fa8072', '#FA6006',
'#FDB603', '#639702', '#dacde6', '#faec72', '#9ab973', '#87cefa']
# Sales 2016
january_2016 = month_year_sales(df_housing_final, 1, 2016)
february_2016 = month_year_sales(df_housing_final, 2, 2016)
march_2016 = month_year_sales(df_housing_final, 3, 2016)
april_2016 = month_year_sales(df_housing_final, 4, 2016)
may_2016 = month_year_sales(df_housing_final, 5, 2016)
june_2016 = month_year_sales(df_housing_final, 6, 2016)
july_2016 = month_year_sales(df_housing_final, 7, 2016)
august_2016 = month_year_sales(df_housing_final, 8, 2016)
september_2016 = month_year_sales(df_housing_final, 9, 2016)
october_2016 = month_year_sales(df_housing_final, 10, 2016)
november_2016 = month_year_sales(df_housing_final, 11, 2016)
december_2016 = month_year_sales(df_housing_final, 12, 2016)
# Sales 2017
january_2017 = month_year_sales(df_housing_final, 1, 2017)
february_2017 = month_year_sales(df_housing_final, 2, 2017)
march_2017 = month_year_sales(df_housing_final, 3, 2017)
april_2017 = month_year_sales(df_housing_final, 4, 2017)
may_2017 = month_year_sales(df_housing_final, 5, 2017)
june_2017 = month_year_sales(df_housing_final, 6, 2017)
july_2017 = month_year_sales(df_housing_final, 7, 2017)
august_2017 = month_year_sales(df_housing_final, 8, 2017)
september_2017 = month_year_sales(df_housing_final, 9, 2017)
october_2017 = month_year_sales(df_housing_final, 10, 2017)
november_2017 = month_year_sales(df_housing_final, 11, 2017)
december_2017 = month_year_sales(df_housing_final, 12, 2017)
# Sales 2018 (Until May)
january_2018 = month_year_sales(df_housing_final, 1, 2018)
february_2018 = month_year_sales(df_housing_final, 2, 2018)
march_2018 = month_year_sales(df_housing_final, 3, 2018)
# List of values
lst_2016 = [january_2016, february_2016, march_2016, april_2016,
may_2016, june_2016, july_2016, august_2016,
september_2016, october_2016, november_2016, december_2016]
lst_2017 = [january_2017, february_2017, march_2017, april_2017,
may_2017, june_2017, july_2017, august_2017,
september_2017, october_2017, november_2017, december_2017]
lst_2018 = [january_2018, february_2018, march_2018]
plot_2016 = go.Scatter(
x=lst_2016,
y=labels,
xaxis='x2',
yaxis='y2',
mode='markers',
name='2016',
marker=dict(
color='rgba(0, 128, 255, 0.95)',
line=dict(
color='rgba(56, 56, 56, 1)',
width=1.5,
),
symbol='circle',
size=16,
)
)
plot_2017 = go.Scatter(
x=lst_2017,
y=labels,
xaxis='x2',
yaxis='y2',
mode='markers',
name='2017',
marker=dict(
color='rgba(255, 72, 72, 0.95)',
line=dict(
color='rgba(56, 56, 56, 1)',
width=1.5,
),
symbol='circle',
size=16,
)
)
plot_2018 = go.Scatter(
x=lst_2018,
y=labels,
xaxis='x2',
yaxis='y2',
mode='markers',
name='2018',
marker=dict(
color='rgba(72, 255, 72, 0.95)',
line=dict(
color='rgba(56, 56, 56, 1)',
width=1.5,
),
symbol='circle',
size=16,
)
)
data = [plot_2016, plot_2017, plot_2018]
layout = go.Layout(title="Sales by Month for the Years <br> (2016, 2017, 2018)",
xaxis=dict(title='Sum of Sales by Month'),
yaxis=dict(title='Month'))
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='multiple-subplots')
This interactive scatterplot confirms how "spring" season is the most productive in terms of houses sold and sales prices.
October 2017 was the best month in these terms, while November has the highest value in terms of two year period (more than $3.6 billions).
# Visualise percent changes for each of the month during the two year period (total amount spent on market)
# Percent Change formula: (New Value - Old Value) / Old Value
from numpy import inf
from scipy import *
def percent_change(df, old_val, new_val):
per_change = ((new_val - old_val)/old_val) * 100
rounded_per_change = float("{0:.2f}".format(per_change))
return rounded_per_change
fst_perchange_lst = []
snd_perchange_lst = []
# For the months between the years 2016-2017
for old, new in zip(lst_2016, lst_2017):
per_change = ((new - old)/ old )* 100
rounded_per_change = float("{0:.2f}".format(per_change))
fst_perchange_lst.append(rounded_per_change)
per_2016_2017 = np.array(fst_perchange_lst)
per_2016_2017[2] = 100
for old, new in zip(lst_2017, lst_2018):
per_change = ((new - old)/ old) * 100
rounded_per_change = float("{0:.2f}".format(per_change))
snd_perchange_lst.append(rounded_per_change)
per_2017_2018 = np.array(snd_perchange_lst)
per_2017_2018 = np.concatenate([per_2017_2018, np.zeros(9)])
per_2017_2018[0] = 100
# Here we should create a line plot for Percent change.
trace0 = go.Scatter(
x = labels,
y = per_2016_2017,
name = 'Percent Change (2016-2017)',
text = '%',
line = dict(
color = ('rgb(220, 10, 20)'),
width = 4,
dash = 'dash'
)
)
trace1 = go.Scatter(
x = labels,
y = per_2017_2018,
name = 'Percent Change (2017-2018)',
text = '%',
line = dict(
color = ('rgb(0, 20, 220)'),
width = 4,
dash = 'dash'
)
)
data = [trace0, trace1]
layout = dict(title = 'Percent Change in House Sales',
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Percent Change (%)'),
paper_bgcolor='rgb(255, 255, 255)'
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='styled-line')
# Establish the avg (price) for each month and year
def avg_month_year_sales(df, month, year):
double_conditional_avg = df_housing_final['Price'].loc[(df_housing_final['Month'] == month) & (df_housing_final['Year'] == year)].mean()
return double_conditional_avg
# Avg Prices 2016
avg_january_2016 = avg_month_year_sales(df_housing_final, 1, 2016)
avg_february_2016 = avg_month_year_sales(df_housing_final, 2, 2016)
avg_march_2016 = avg_month_year_sales(df_housing_final, 3, 2016)
avg_april_2016 = avg_month_year_sales(df_housing_final, 4, 2016)
avg_may_2016 = avg_month_year_sales(df_housing_final, 5, 2016)
avg_june_2016 = avg_month_year_sales(df_housing_final, 6, 2016)
avg_july_2016 = avg_month_year_sales(df_housing_final, 7, 2016)
avg_august_2016 = avg_month_year_sales(df_housing_final, 8, 2016)
avg_september_2016 = avg_month_year_sales(df_housing_final, 9, 2016)
avg_october_2016 = avg_month_year_sales(df_housing_final, 10, 2016)
avg_november_2016 = avg_month_year_sales(df_housing_final, 11, 2016)
avg_december_2016 = avg_month_year_sales(df_housing_final, 12, 2016)
# AVG Prices 2017
avg_january_2017 = avg_month_year_sales(df_housing_final, 1, 2017)
avg_february_2017 = avg_month_year_sales(df_housing_final, 2, 2017)
avg_march_2017 =avg_month_year_sales(df_housing_final, 3, 2017)
avg_april_2017 = avg_month_year_sales(df_housing_final, 4, 2017)
avg_may_2017 = avg_month_year_sales(df_housing_final, 5, 2017)
avg_june_2017 = avg_month_year_sales(df_housing_final, 6, 2017)
avg_july_2017 = avg_month_year_sales(df_housing_final, 7, 2017)
avg_august_2017 = avg_month_year_sales(df_housing_final, 8, 2017)
avg_september_2017 = avg_month_year_sales(df_housing_final, 9, 2017)
avg_october_2017 = avg_month_year_sales(df_housing_final, 10, 2017)
avg_november_2017 = avg_month_year_sales(df_housing_final, 11, 2017)
avg_december_2017 = avg_month_year_sales(df_housing_final, 12, 2017)
# Avg Prices 2018 (Until May)
avg_january_2018 = avg_month_year_sales(df_housing_final, 1, 2018)
avg_february_2018 = avg_month_year_sales(df_housing_final, 2, 2018)
avg_march_2018 = avg_month_year_sales(df_housing_final, 3, 2018)
# List of values avg
avg_lst_2016 = [avg_january_2016, avg_february_2016, avg_march_2016, avg_april_2016,
avg_may_2016, avg_june_2016, avg_july_2016, avg_august_2016,
avg_september_2016, avg_october_2016, avg_november_2016, avg_december_2016]
avg_lst_2017 = [avg_january_2017, avg_february_2017, avg_march_2017, avg_april_2017,
avg_may_2017, avg_june_2017, avg_july_2017, avg_august_2017,
avg_september_2017, avg_october_2017, avg_november_2017, avg_december_2017]
avg_lst_2018 = [avg_january_2018, avg_february_2018, avg_march_2018]
#def perchange list
avg_fst_perchange_lst = []
avg_snd_perchange_lst = []
# For the months between the years 2016-2017
for old, new in zip(avg_lst_2016, avg_lst_2017):
per_change = ((new - old)/ old )* 100
rounded_per_change = float("{0:.2f}".format(per_change))
avg_fst_perchange_lst.append(rounded_per_change)
avg_per_2016_2017 = np.array(avg_fst_perchange_lst)
avg_per_2016_2017[2] = 100
for old, new in zip(avg_lst_2017, avg_lst_2018):
per_change = ((new - old)/ old) * 100
rounded_per_change = float("{0:.2f}".format(per_change))
avg_snd_perchange_lst.append(rounded_per_change)
avg_per_2017_2018 = np.array(avg_snd_perchange_lst)
avg_per_2017_2018 = np.concatenate([avg_per_2017_2018, np.zeros(9)]) #first 3 months in 2018
avg_per_2017_2018[0] = 100
# Here we should create a line plot for Percent change.
trace0 = go.Scatter(
x = labels,
y = avg_per_2016_2017,
name = 'Percent Change (2016-2017)',
text = '%',
line = dict(
color = ('rgb(220, 10, 20)'),
width = 4,
dash = 'dash'
)
)
trace1 = go.Scatter(
x = labels,
y = avg_per_2017_2018,
name = 'Percent Change (2017-2018)',
text = '%',
line = dict(
color = ('rgb(0, 20, 220)'),
width = 4,
dash = 'dash'
)
)
data = [trace0, trace1]
layout = dict(title = 'Percent Change in Average Price',
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Percent Change (%)'),
paper_bgcolor='rgb(255, 255, 255)'
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='styled-line')
This final trend plots summarise the difference of the total amount spent on house market, and the % change in average Price respectively per each month (compared to the previous year). February is no doubt the most striking month! While reached the peak in % of total amount spent from 2016 to 2017 (+1754%) gaining a +23% on average sales price; in 2018 it registered an increase of +153% but at the same time a critical drop (about -11%) on the average price.
In this subsection we try to use K-means clustering algorithm applied to Price distribution.
#make a copy for cluster analysis
df_housing_cluster = df_housing_final.copy()
# import plotly.plotly as py
# import plotly.graph_objs as go
# from plotly import tools
from sklearn.cluster import KMeans
# plot the Price distribution "normal" and with a log-transf
f, axes = plt.subplots(1,1, figsize = (14,10))
plt.subplot(221)
sns.distplot(df_housing_cluster.Price, hist=True)
plt.title('Price distribution')
plt.subplot(222)
sns.distplot(log(df_housing_cluster.Price), hist=True)
plt.title('Log-Price distribution')
df_housing_cluster.info()
df_housing_cluster["Price_log"] = log(df_housing_cluster.Price)
df_housing_cluster["Price_log"].describe()
# define bins for (log)Price
bins_pclust=[11,12.5,13,13.5,14,16.5]
df_housing_cluster['Price_log_bins']=np.digitize(df_housing_cluster.Price_log,bins_pclust)
df_housing_cluster.Price_log_bins.value_counts()
# Preprocess Rooms, Bathrooms and Car
L = len(unique(df_housing_cluster.Rooms_v2))
df_housing_cluster["Rooms_v2_clust"] = (df_housing_cluster.Rooms_v2 -0.5)/L
L2 = len(unique(df_housing_cluster.Bathroom))
df_housing_cluster["Bath_clust"] = (df_housing_cluster.Bathroom -0.5)/L2
L3 = len(unique(df_housing_cluster.Car))
df_housing_cluster["Car_clust"] = (df_housing_cluster.Car -0.5)/L2
# insert the 5 features in a separate df
prova = df_housing_cluster[["Rooms_v2_clust","Bath_clust","Car_clust","Price_log_bins","Distance"]]
prova.head()
# Cluster analysis
## Elbow graph - choosing the "k"
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(prova)
prova["clusters"] = kmeans.labels_
#print(prova["clusters"])
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.title("Cluster Analysis - Elbow Method")
plt.show()
# so it seems that we have an elbow, with a significant reduction of SSE, in k=3.
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(prova)
prova["clusters"] = kmeans.labels_
y_kmeans = kmeans.predict(prova)
#Let's plot first with a 2D scatter, then with a 3D plot
# Subplots of 2D Cluster
#plt.scatter(prova.Bath_clust, prova.Distance, c=y_kmeans, s=50, cmap='viridis')
plt.scatter(prova.Car_clust, prova.Distance, c=y_kmeans, s=50, cmap='viridis')
plt.xlabel("Car")
plt.ylabel("Distance")
plt.title("2D Cluster Visual Interaction")
plt.show()
plt.scatter(prova.Price_log_bins, prova.Distance, c=y_kmeans, s=50, cmap='viridis')
plt.xlabel("Price")
plt.ylabel("Distance")
plt.show()
plt.scatter(prova.Rooms_v2_clust, prova.Bath_clust, c=y_kmeans, s=50, cmap='viridis')
plt.xlabel("Rooms")
plt.ylabel("Bathroom")
plt.show()
plt.scatter(prova.Bath_clust, prova.Distance, c=y_kmeans, s=50, cmap='viridis')
plt.xlabel("Bathroom")
plt.ylabel("Distance")
plt.show()
# scatter 3d w plotly
import plotly
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
trace1 = go.Scatter3d(
x=prova.Rooms_v2_clust,
y=prova.Price_log_bins,
z=prova.Distance,
mode='markers',
marker=dict(
size=12,
color=prova.Distance,
colorscale='Viridis',
opacity=0.8
)
)
data = [trace1]
layout = go.Layout(title="<br> Cluster 3D Plot (Rooms, Price, Distance)",
margin=dict(
l=0,
r=0,
b=0,
t=0
)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
From cluster analysis we can see the huge impact of Distance on grouping different features by its "measure". Regarding the interaction between Rooms and Bathrooms we can see well-defined clusters when these two features are small. But also with high values of Rooms we have a major cluster (purple on 2D scatterplot). Thus, we expect a greater "importance" of Rooms over Bathrooms in predicting a final Price when their values increase. Just to clarify we might try to repeform this cluster analysis with k=4 and maybe define 4 bins for Price, with the addition of "rescaling" the Distance.
#First plot: House prices by Location
df_housing_final.plot(kind="scatter", x="Longtitude_x", y="Lattitude_x", alpha=0.5,
c=log(df_housing_final.Price), cmap=plt.get_cmap("jet"), label= 'Price by Location', figsize=(14,10))
plt.ylabel("Latitude", fontsize=14)
plt.legend(fontsize=12)
# plot with House_class, but first convert categorical to "codes"
df_housing_final["House_Class_Cat"] = df_housing_final["House_Class"].cat.codes;
# Second plot: House_class by location
df_housing_final.plot(kind="scatter", x="Longtitude_x", y="Lattitude_x", alpha=0.5,
c=log(df_housing_final.House_Class_Cat), cmap=plt.get_cmap("jet"), label= 'House Age by Location', figsize=(14,10))
plt.ylabel("Latitude", fontsize=14)
plt.legend(fontsize=12)
plt.show()
These maps really emphasize the "growth" and the urban evolution of the city of Melbourne. Highest price houses are located in the city centre and close to the bay as well as the oldest houses.
This section provides a ML analysis using techniques in order to predict the Price of a houses from its predictors.
# Let's recheck our DF
df_housing_final.info()
# So, before starting with the split in train and test set we need to choose the features that we consider good predictors
# for the final price and transform those categorical.
df_housing_final.corr() #Check numerical features with Price
# Dummy variable for the categorical ones
df_ML = pd.get_dummies(df_housing_final, columns= ['Type', 'Regionname', 'House_Class','Season'])
# Let's keep main variables
df_ML= df_ML[["Price","Distance","Bathroom","Car","Landsize_x","Longtitude_x","Lattitude_x","BuildingArea_Replace_x","Rooms_v2","Age","Total_Area_bins",
"Type_h","Type_t","Type_u","Regionname_Eastern Metropolitan","Regionname_Eastern Victoria",
"Regionname_Northern Metropolitan","Regionname_Eastern Victoria","Regionname_South-Eastern Metropolitan",
"Regionname_Southern Metropolitan","Regionname_Western Metropolitan","Regionname_Western Victoria",
"House_Class_Contemporary","House_Class_Historic","House_Class_Modern","Season_fall","Season_spring","Season_summer","Season_winter"]]
#check variables
df_ML.info()
# Proceed with "features scaling" since we have different "value ranges"
from sklearn.preprocessing import RobustScaler
# Assign "Price" to "Y" var and predictors to X variables
X = df_ML.drop("Price", axis=1)
Y = df_ML["Price"].copy()
scaler = RobustScaler()
df_ML= scaler.fit_transform(df_ML.astype(np.float64))
# Ready for predictions. We will compare 2 ML methods: Linear Regressions vs Random Forest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = .30, random_state= 101) #split 70-30
# fit Regression
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
# fit Random forest
forest = RandomForestRegressor()
forest.fit(x_train, y_train);
# Try to test on train data first, evaluating the RMSE and the MAPE (for accurary)
models= [('Lin_reg', lin_reg), ('Random Forest', forest)]
from sklearn.metrics import mean_squared_error
for i, model in models:
predictions = model.predict(x_train)
MSE = mean_squared_error(y_train, predictions)
RMSE = np.sqrt(MSE)
msg_err = "%s = %.2f" % (i, round(RMSE, 2))
errors = abs(predictions - y_train)
mape = np.mean(100 * (errors / y_train)) #calculate Mean Abs Percentage Error
accuracy = 100 - mape
msg_acc = "%s= %.2f"% (i, round(accuracy, 2))
print('RMSE of', msg_err, 'Accuracy of', msg_acc,'%')
We expected that RF performed better than Linear Regression. However, we are predicting on train data, so the risk of overfitting could exist (especially for the RF). To understand better the performances of the model and to predict on test data we should use an alternative strategy: cross-validation, keeping only RF as the "regression" method for our final prediction.
# for both models we want to test 3 different scoring metrics (RMSE, MAE and R^2) with a 10 folds CV.
# Then we will average and save their results into a df
from sklearn.model_selection import cross_validate
models= [('Lin_reg', lin_reg), ('Random Forest', forest)]
scoring = ['neg_mean_squared_error', 'neg_mean_absolute_error', 'r2']
res= []
for name, model in models:
for i in scoring:
scores = cross_validate(model, x_train, y_train, scoring=i, cv=10, return_train_score=True)
res.append(scores)
# Lin Reg (Average of results)
LR_RMSE_mean = np.sqrt(-res[0]['test_score'].mean()) # we perform the 'sqrt' in order to obtain the RMSE
LR_RMSE_std= res[0]['test_score'].std()
LR_MAE_mean = -res[1]['test_score'].mean()
LR_MAE_std= res[1]['test_score'].std()
LR_R2_mean = res[2]['test_score'].mean()
LR_R2_std = res[2]['test_score'].std()
# Random Forest's averages
RF_RMSE_mean = np.sqrt(-res[3]['test_score'].mean())
RF_RMSE_std= res[3]['test_score'].std()
RF_MAE_mean = -res[4]['test_score'].mean()
RF_MAE_std= res[4]['test_score'].std()
RF_R2_mean = res[5]['test_score'].mean()
RF_R2_std = res[5]['test_score'].std()
# storing results in a DF
Model_DF = pd.DataFrame({
'Model' : ['Linear Regression', 'Random Forest'],
'RMSE_mean' : [LR_RMSE_mean, RF_RMSE_mean],
'RMSE_std' : [LR_RMSE_std, RF_RMSE_std],
'MAE_mean' : [LR_MAE_mean, RF_MAE_mean],
'MAE_std' : [LR_MAE_std, RF_MAE_std],
'R2_mean' : [LR_R2_mean, RF_R2_mean],
'R2_std' : [LR_R2_std, RF_R2_std],
}, columns = ['Model', 'RMSE_mean', 'RMSE_std', 'MAE_mean', 'MAE_std', 'R2_mean', 'R2_std'])
Model_DF.sort_values(by='R2_mean', ascending=False)
Considering these results, we are quite impressed by the average of RF's R$^2$, but in terms of errors we are expecting a larger gap between RF and Linear Regression. Let's proceed to fine tuning our RF model
# Visualise linear regression coefficient in order to have an idea of how features impact the final predicted Price
coeff_df = pd.DataFrame(lin_reg.coef_,X.columns,columns=['Coefficient'])
ranked_features_reg = coeff_df.sort_values("Coefficient", ascending = False)
ranked_features_reg
Explaining briefly these coefficients:
Regarding Random Forest we try to fine-tune this model, adopting two different methodologies: Grid Search and Randomized Search. These two explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower. This is due mainly because in Grid Search we might try every combination of the hyperparameters, choosing the best one based on the CV score. Randomized Search tries random combination of hyperparameters within a defined number of iterarions. Although it is faster, it does not guarantee to give the final best set of parameters combination.
Most important hyperparameters of Random Forest are:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Grid Search
# parameters: we keep default value for min_sample_split and leaf ("None" value)
param_grid = [ {'n_estimators': [5, 10, 20], 'max_features': [3, 5, 10],
'max_depth': [10, 25, 50, None], 'bootstrap': [True, False]}]
grid_search_RF = GridSearchCV(forest, param_grid, cv=10, scoring='neg_mean_squared_error')
grid_search_RF.fit(x_train, y_train);
# Find the best model of grid search
print(grid_search_RF.best_estimator_)
# Performance metrics
grid_best= grid_search_RF.best_estimator_.predict(x_train)
errors = abs(grid_best - y_train)
# Calculate mean absolute percentage error (MAPE) on train
mape = np.mean(100 * (errors / y_train))
# Calculate and display accuracy on train
accuracy = 100 - mape
#print result
print('The best model from grid-search (train set) has an accuracy of', round(accuracy, 2),'%')
# RMSE on train
grid_mse = mean_squared_error(y_train, grid_best)
grid_rmse = np.sqrt(grid_mse)
print('The best model from the grid search (train set) has a RMSE of', round(grid_rmse, 2))
# Randomized Search
from pprint import pprint
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 5, stop = 200, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(1, 50, num = 3)]
# Minimum number of samples required to split a node
min_samples_split = [5, 10]
# Bootstrap (method of selecting samples for training each tree)
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'bootstrap': bootstrap
}
pprint(random_grid)
# Random search of parameters, using 10 fold cross validation,
rf_random = RandomizedSearchCV(estimator = forest, param_distributions = random_grid, n_iter = 10, cv = 10, verbose=2, random_state=42, n_jobs = -1, scoring='neg_mean_squared_error')
# Fit the random search model
rf_random.fit(x_train, y_train)
# best random model
print(rf_random.best_estimator_)
# best combination of parameters of random search
rf_random.best_params_
# Performance metrics (MAPE) on train
random_best= rf_random.best_estimator_.predict(x_train)
errors = abs(random_best - y_train)
# Calculate mean absolute percentage error (MAPE) on train
mape = np.mean(100 * (errors / y_train))
# Calculate and display accuracy on train
accuracy = 100 - mape
#print result
print('The best model from the randomized search (train set) has an accuracy of', round(accuracy, 2),'%')
#this is the RMSE on train
final_mse = mean_squared_error(y_train, random_best)
final_rmse = np.sqrt(final_mse)
print('The best model from the randomized search train set) has a RMSE of', round(final_rmse, 2))
To recap, we have:
Grid Search:
Randomized Search:
Thus, Grid Search performed better than the other, even if we have great values of Accuracy for both methodologies.
Before moving on the test data, using the Grid Search best estimator, we will first focus our attentation on the "feature importance". This will give us insight to what are the most important "factor" in predicting the final Price.
# extract the numerical values of feature importance from the grid search best estimator
importances = grid_search_RF.best_estimator_.feature_importances_
#create a feature list from the original dataset (list of columns)
# What are this numbers? Let's get back to the columns of the original dataset
feature_list = list(X.columns)
#create a list of tuples
feature_importance= sorted(zip(importances, feature_list), reverse=True)
#create two lists from the previous list of tuples
df = pd.DataFrame(feature_importance, columns=['importance', 'feature'])
importance= list(df['importance'])
feature= list(df['feature'])
#see df
print(df)
# list of feature for plotting
x_values = list(range(len(feature_importance)))
# Make a bar chart
plt.figure(figsize=(15,10))
plt.bar(x_values, importance, orientation = 'vertical')
# Tick labels for x axis
plt.xticks(x_values, feature, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
To sum up:
Southern Metropolitan region is the best feature for importance. In fact, if we recall this region is the area with the highest median home prices and also has the biggest density of houses (including the City of Melbourne).
BuidingArea and Distance are the second and third features to reliably predict the price of a house in Melbourne.
To follow: Latitude and Longitude with Landsize and Rooms complete the "package" of the most important features.
Overall type of house seems to have a small impact, especially if we consider "townhouses"(t) or "unit and duplex" (u). Age of a house is slighlty better than Bathroom, while seasons and HouseClass have less than 1% of "feature importance".
# Evaluate on Test with the grid search best estimator model
final_model = grid_search_RF.best_estimator_
# Predicting test set results
final_pred = final_model.predict(x_test)
final_mse = mean_squared_error(y_test, final_pred)
final_rmse = np.sqrt(final_mse)
print('The final RMSE on the test set is', round(final_rmse, 2))
#calculate accuracy
errors = abs(final_pred - y_test)
# Calculate mean absolute percentage error (MAPE)
mape = np.mean(100 * (errors / y_test))
# Calculate and display accuracy
accuracy = 100 - mape
#print result
print('The best model achieves on the test set an accuracy of', round(accuracy, 2),'%')
# Plot Train and Test accuracy as max_depths (number of trees in a RF) increases
# Try to identify when overfitting begins
max_depths = np.linspace(1, 50, 50, endpoint=True)
train_results = []
test_results = []
for i in max_depths:
dt = RandomForestRegressor(max_depth=i)
dt.fit(x_train, y_train)
#compute accuracy for train data
housing_tree = dt.predict(x_train)
errors = abs(housing_tree - y_train)
# Calculate mean absolute percentage error (MAPE) on train
mape = 100 * (errors / y_train)
# Calculate and display accuracy (train)
accuracy = 100 - np.mean(mape)
#append results of accuracy (train)
train_results.append(accuracy)
#now again for test data
housing_tree = dt.predict(x_test)
errors = abs(housing_tree - y_test)
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
#append results of accuracy
test_results.append(accuracy)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(max_depths, train_results, 'b', label='Train accuracy')
line2, = plt.plot(max_depths, test_results, 'r', label= 'Test accuracy')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('Accuracy score (%)')
plt.xlabel('Tree depth')
plt.show()
This final "screeplot" underlines the overfitting aspect: we can clerly see that after ~10 splits the test performance does not increase anymore, while the trainset stop increasing around 20 splits. Here is where overfitting begins!
After this long journey in Melbourne we have collected a number of interesting points and ideas. During the first months of 2018, Melborune's housing market has registered a drop in house price sales, even if the property transaction volumes did not have the same trend (but remind that we have only the first 3 months of the 2018 and also we have deleted observation without Price in our analysis). Location and size are the most relevant features that affect the Price, on the other hand who wants to live in the countryside with a not high standard of living (Did someone said Brooklyn?!).
Apart from that and moving into our statistical analysis, we have tried to obtain a global understanding of the house market using a wide range of methods: from an initial EDA, through a time series analysis, followed by a cluster, to conclude with a regression analysis comparing a general linear model (Linear Regression) with an ensemble model (RF).
A future aim could be to remove the less relevant features, then estimate a new model comparing their results. Of course, we will lose a bit of predictive power, but at the same time it would gain in terms of ease and interpretability of the model as wells as in terms of computational time.
Another suggestion could be utilise a variation of linear regression, perhaps a "Ridge" or "Lasso" regression. The former should perform better when data suffer from multicollinearity, by adding a degree of bias to the regression estimates, this technique reduces the standard errors giving a more reliable estimates. The latter is a method that performs both variable selection and regularisation in order to enhance the prediction accuracy and interpretability of the statistical model it produces (Wikipedia).